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Abstract: There is increasing interest in the problem of nonparametric re- 
gression with high-dimensional predictors. When the number of predictors 
D is large, one encounters a daunting problem in attempting to estimate a 
D-dimensional surface based on limited data. Fortunately, in many applica- 
tions, the support of the data is concentrated on a d-dimensional subspace 
with d <C D. Manifold learning attempts to estimate this subspace. Our fo- 
cus is on developing computationally tractable and theoretically supported 
Bayesian nonparametric regression methods in this context. When the sub- 
space corresponds to a locally-Euclidean Riemannian manifold, we show 
that a Gaussian process regression approach can be applied that leads to 
the minimax optimal adaptive rate in estimating the regression function 
under some conditions. The proposed model bypasses the need to estimate 
the manifold, and can be implemented using standard algorithms for pos- 
terior computation in Gaussian processes. Finite sample performance is 
illustrated in an example data analysis. 
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1. Introduction 

Dimensionality reduction in nonparametric regression is of increasing interest 
given the routine collection of high- dimensional predictors in many application 
areas. In particular, our primary focus is on the regression model 

Yi = f(Xi) + ei, ei~N(0,a 2 ), i = l,...,n, (1.1) 

where F; <E R, X; G R D , f is an unknown regression function, and is a residual 
having variance a 2 . We face problems in estimating / accurately due to the 
moderate to large number of predictors D. Fortunately, in many applications, 
the predictors have support that is concentrated near a d-dimensional subspace 
M. If one can learn the mapping from the ambient space to this subspace, the 
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dimensionality of the regression function can be reduced massively from D to 
d, so that / can be much more accurately estimated. 

There is an increasingly vast literature on the topic of subspace learning, but 
there remains a lack of approaches that allow flexible non-linear dimensionality 
reduction, are scalable computationally to moderate to large D, have theoretical 
guarantees and provide a realistic characterization of uncertainty. Regarding this 
last point, we would like to be able to characterize uncertainty in estimating the 
regression function /, in functionals of / of interest and in predictions. Typical 
two-stage approaches in which one conducts dimensionality reduction in a first 
stage, and then plugs the ^-dimensional features into a next stage regression 
may provide a point estimate with good properties but do not characterize 
uncertainty in this estimate. 

With this motivation, we focus on Bayesian nonparametric regression meth- 
ods that allow M. to be an unknown Riemannian manifold. One natural direction 
is to choose a prior to allow uncertainty in A"f, while also placing priors on the 
mapping from xi to M., the regression function relating the lower-dimensional 
features to the response, and the residual variance. Some related attempts have 
been made in the literature. [29] propose a logistic Gaussian process model, 
which allows the conditional response density f(y\x) to be unknown and chang- 
ing flexibly with x, while reducing dimension through projection to a linear 
subspace. Their approach is elegant and theoretically grounded, but does not 
scale efficiently as D increases and is limited by the linear subspace assump- 
tion. Also making the linear subspace assumption, [23] proposed a Bayesian 
finite mixture model for sufficient dimension reduction. [22] instead propose a 
method for Bayesian nonparametric learning of an affine subspace motivated by 
classification problems. 

There is also a limited literature on Bayesian nonlinear dimensionality reduc- 
tion. Gaussian process latent variable models (GP-LVMs) [16] were introduced 
as a nonlinear alternative to PCA for visualization of high-dimensional data. 
[15] proposed a related approach that defines separate Gaussian process regres- 
sion models for the response and each predictor, with these models incorporating 
shared latent variables to induce dependence. The latent variables can be viewed 
as coordinates on a lower dimensional manifold, but daunting problems arise in 
attempting to learn the number of latent variables, the distribution of the latent 
variables, and the individual mapping functions while maintaining identifiabil- 
ity restrictions. [8] instead approximate the manifold through patching together 
hyperplanes. Such mixtures of linear subspace-based methods may require a 
large number of subspaces to obtain an accurate approximation even when d is 
small. 

It is clear that probabilistic models for learning the manifold face daunting 
statistical and computational hurdles. In this article, we take a very different 
approach in attempting to define a simple and computationally tractable model, 
which bypasses the need to estimate M. but can exploit the lower-dimensional 
manifold structure when it exists. In particular, our goal is to define an approach 
that obtains a minimax-optimal adaptive rate in estimating /, with the rate 
adaptive to the manifold and smoothness of the regression function. Surprisingly, 
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we show that this can be achieved with a simple Gaussian process prior. 

Section 2 defines the proposed model and gives some basic geometric back- 
ground along with a heuristic motivation for the model. Section 3 contains a 
more thorough background of necessary geometric concepts. Section 4 provides 
theory on adaptive rates. Section 5 contains simulation studies of finite sample 
performance relative to competitors, and Section 7 discusses the results. 

2. Gaussian Processes on Manifolds 
2.1. Background 

Gaussian processes (GP) are widely used as prior distributions for unknown 
functions due to tractable posterior computation and strong theoretical guaran- 
tees. For example, in the nonparametric regression (1.1), a GP can be specified 
as a prior for the unknown function /. In classification, the conditional distri- 
bution of the binary response Y\ is related to the predictor X{ through a known 
link function h and a regression function / as Yi\X{ ~ Ber \h{f(Xi)}\ , where 
/ is again given a GP prior. The following developments will mainly focus on 
the regression case. The GP with squared exponential covariance is a commonly 
used prior in the literature. The law of the centered squared exponential GP 
{W x : x G X} is entirely determined by its covariance function, 

K a (x,y) = EW x W y = exp(-a 2 ||x - y\\ 2 ), (2.1) 

where the predictor domain <Y is a subset of R^, 1 1 • 1 1 is the usual Euclidean norm 
and a is a length scale parameter. Although we focus on the squared exponential 
case, our results can be extended to a broader class of covariance functions 
with exponentially decaying spectral density, including standard choices such as 
Matern, with some elaboration. We use GP(m, K) to denote a GP with mean 
m : X — )• R and covariance K : X x X R. 

Given n independent observations, the minimax rate of estimating a D- 
variate function that is only known to be Holder s-smooth is n - s /( 2s + D ) [26]. 
A function in R D is said to be Holder s-smooth if it has bounded mixed partial 
derivatives up to order [s\ for [s\ the largest integer strictly smaller than s with 
the partial derivative of order [s\ being Lipschitz-continuous of order s — [s\. 
Surprisingly, [31] proved that, for Holder s-smooth functions, a prior specified 
as 

W A \A~GP(0,K A ) J A D ~Ga(a ,&o), ( 2 - 2 ) 

for Ga(ao,6o) ^ ne Gamma distribution with pdf p(t) oc t a °~ 1 e~ b ° t leads to the 
minimax rate n - s /( 2s + D ) U p to a logarithmic factor (logn)^ with /3 ~ D adap- 
tively over all s > without knowing s in advance. The superscript in W A 
indicates the dependence on A, which can be viewed as a scaling or inverse 
bandwidth parameter. Although the sample paths from this GP prior are al- 
most surely infinitely different iable, an intuitive explanation for such smoothness 
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Fig 1. In this data, 72 size 128 x 128 images were taken for a "lucky cat" from different 
angles: one at every 5 degrees of rotation. 36 images are displayed in this figure. 

adaptibility is that less regular or wiggly functions can be well approximated 
by shrinking the long path of a smooth function by a large factor a. 

In many real problems, the predictor X can be represented as a vector in high 
dimensional Euclidean space R D , where D is called the ambient dimensionality. 
Due to the curse of dimensionality, the minimax rate n - s /( 2s + D ) w [\\ deteriorate 
rapidly as D increases. This will become extremely fatal in the notorious small n 
large p problem, where D can be much larger than the sample size n. In such high 
dimensional situations, there is no hope to accurately estimate the regression 
function / without any assumption on the true model. One common assump- 
tion requires that / only depends on a small number d <C n of components of 
the vector X that are identified as important. In the GP prior framework, [25] 
proposed to use "spike and slab" type point mass mixture priors for different 
scaling parameters for each component of X to do Bayesian variable selection. 
[4] showed that carefully calibrated implementations of this approach can lead to 
minimax adaptive rates of posterior concentration. However, variable selection 
is a very restrictive notion of dimension reduction. Our focus is on a different 
notion, which is that the predictor lies on a manifold M. of intrinsic dimension d 
much lower than the ambient space dimension D. This manifold can be consid- 
ered as a d dimensional hyper surface in MP . A rigorous definition is described 
in section 3. A concrete example is shown in Fig.l. These data ([21]) consist of 
72 images of a "lucky cat" taken from different angles 5°, 10°, .... The predictor 
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X G R is obtained by vectorizing the 128 x 128 image. The response Y is a 
continuous function / of the rotation angle G [0, 27r] satisfying /(0) = /(27r), 
such as sin or cos functions. Intuitively, the predictor X concentrates on a circle 
in D = 128 2 -dim ambient space and thus the intrinsic dimension d of X is equal 
to one, the dimension of the rotation angle 6. 

2.2. Our Model and Rate Adaptivity 

When X G M. with M. d-dimensional, a natural question is whether we can 
achieve the intrinsic rate n - s /( 2s + d ) for / Holder s-smooth without estimating 
M. Surprisingly, the answer is affirmative. [33] showed that a least squares 
regularized algorithm with an appropriate d dependent regularization parameter 
can ensure a convergence rate at least n - s /( 8s + 4c 0(l O g n ) 2s /( 8s + 4c fo r functions 
with Holder smoothness s < 1. [5] proved that local polynomial regression with 
bandwidth dependent on d can attain the minimax rate n - s /( 2s + d ) fo r functions 
with Holder smoothness s < 2. However, similar adaptive properties have not 
been established for a Bayesian procedure. In this paper, we will prove that a GP 
prior on the regression function with a proper prior for the scaling parameter can 
lead to the minimax rate for functions with Holder smoothness s < {2,7 — 1}, 
where 7 is the smoothness of the manifold hA. In the remainder of this section, 
we first propose the model, and then provide a heuristic argument explaining 
the possibility of manifold adaptivity. Formal definitions and descriptions of 
important geometric concepts can be found in the next section. 

Analogous to (2.2), we propose the prior for the regression function / as 

W A \ A ~ GP(0, K A ), A d ~ Ga(a , 6 ), (2.3) 

where d is the intrinsic dimension of the manifold A4 and K a is defined as in 
(2.1) with || • || the Euclidean norm of the ambient space M D . Although the GP 
in (2.3) is specified through embedding in the R D ambient space, we essentially 
obtain a GP on M. if we view the covariance function K a as a bivariate function 
defined on A4 x A4. Moreover, this prior has two major differences with usual 
GPs or GP with Bayesian variable selection: 

1. Unlike GP with Bayesian variable selection, all predictors are used in the 
calculation of the covariance function K a \ 

2. The dimension D in the prior for inverse bandwidth A is replaced with 
the intrinsic dimension d. 

Generally, the intrinsic dimension d is unknown and needs to be estimated. 
Many estimation methods has been proposed [7, 6, 17, 18]. For example, [17] 
considered a likelihood based approach and [18] relies on singular value de- 
composition of local sample covariance matrix. We will use [17] to obtain an 
estimator d and then plug in this estimator into our prior (2.3) to obtain an 
empirical Bayes approach. 

In our model, we only need to estimate the intrinsic dimensionality d rather 
than the manifold A4. Most algorithms for learning J\A become computationally 
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demanding as the ambient space dimensionality D increases, while estimating 
d is fast even when D is tens of thousands. Moreover, although we use the full 
data in the calculation of the covariance function, computation is still fast for 
moderate sample sizes n regardless of the size of D since only pairwise Euclidean 
distances among .D-dimensional predictors are involved whose computational 
complexity scales linearly in D. This dimensionality scalability provides huge 
gains over two stage approaches (section 2.3) in high dimensional regression 
settings even though they can also achieve the optimal posterior convergence 
rate (Theorem 2.3). 

Intuitively, one would expect that geodesic distance should be used in the 
square exponential covariance function (2.1). However, there are two main ad- 
vantages of using Euclidean distance instead of geodesic distance. First, when 
geodesic distance is used, the covariance function may fail to be positive defi- 
nite. In contrast, with Euclidean distance in (2.1), K a is ensured to be positive 
definite. Second, for a given manifold A"f, the geodesic distance can be specified 
in many ways through different Riemannian metrics on A4 (section 3.1). Ac- 
cording to Lemma 3.6, all these geodesic distances are equivalent to each other 
and the Euclidean distance on R D . Therefore, by using the Euclidean distance, 
we bypass the need to estimate geodesic distance, but still reflect the geometric 
structure of the observed predictors in terms of pairwise distances. 

We provide heuristic explanations on why the rate can adapt to the predic- 
tor manifold through two observations. The first focuses on the possibility of 
obtaining an intrinsic rate for the regression problem (1.1) per se. Although 
the ambient space is R D , the support M. of the predictor X is a d dimension 
submanifold of R D . As a result, the GP prior specified in section 2.1 has all 
probability mass on the functions supported on this support, leading the pos- 
terior contraction rate to entirely depend on the evaluations of / on A4. More 
specifically, the posterior contraction rate is lower bounded by any sequence 
{e n : n > 1} such that 

n(d(/,/ ) >e n |X n ) ^0, n^oo, 

where II( A \ X n ) is the posterior probability of A and d 2 (f, Jo) = (V n )Xir=i {f( x i)~ 
fo(xi)) 2 under fixed design or d 2 (f,f ) = J M (f(x) - fo(x)) 2 G(dx) under ran- 
dom design, with G the marginal distribution for predictor X. Hence, /o) 
measures the discrepancy between / and the truth /o, and only depends on the 
evaluation of / on A4. Therefore, in a prediction perspective, we only need to 
fit and infer / on M . Intuitively, we can consider a special case when the points 
on manifold M have a global smooth representation x = (f)(i), where t G M d is 
the global latent coordinate of x. Then the regression function 

/(*) = =M*)> *eR d , ( 2 - 4 ) 

is essentially a d-variate s-smooth function if </) is sufficiently smooth. Then 
estimation of / on M D boils down to estimation of h on R d and the intrinsic 
rate would be attainable. For the general case, we can consider parameterizing a 
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Fig 2. Examples of one dimensional submanifolds in R 2 . 



compact manifold M by a finite number of local charts {(Ui, : i = 1, . . . , m} 
and obtain (2.4) for x in each local neighborhood [/j CM. However, since the 
parametrization \i in (2.4) is unknown or even does not exist, one possible goal 
is to develop methods that can adapt to low dimensional manifold structure. 

This motivates the second observation on the possibility of obtaining the 
intrinsic rate via the ambient space GP prior specified in (2.2). With this prior, 
the dependence among {f(xi)}f =1 is entirely characterized by the covariance 
matrix (K A (xi, Xj)) nxn , which depends on the pairwise Euclidean distance e 
among observed predictors {#i}2=i- Ideally, a distance (1m used in the covariance 
matrix should be an intrinsic distance, which measures the distance by traveling 
from one point to the other without leaving A4. More formally, an intrinsic 
distance is defined as the infimum of the length of all paths between two points. 
In the special case of (2.4), (1m{ x i x ') would be e[(j)~ 1 {x) 1 (j)~ 1 {x')) if <p is an 
isometric embedding from R d into R D . Fig. 2 also gives two simple examples 
where M. is a one dimensional submanifold in R 2 . Although B and C are close 
in Euclidean distance, they are far away in terms of intrinsic distance, which is 
the length of the arc from B to C. Fortunately, Lemma 3.6 in the next section 
suggests that for compact submanifolds, this bad phenomenon only occurs for 
remote points — d and (Im will become comparable as two points move close. 
Moreover, as two points A and B become closer, using d to approximate the 
intrinsic distance dM only introduces higher order error (see Proposition 3.5) 
proportional to the curvature of .M, which characterizes local distortion. In 
contrast, in the right plot is a straight segment in R 2 . In this case Euclidean 
distance always matches the intrinsic distance and whether the A4 itself is known 
would make no difference in predicting / since a straight segment is locally flat 
and has zero curvature. 

A typical nonparametric approach estimates f(x) by utilizing data at points 
near x, such as averaging over samples in a <5 n -ball around x, where the band- 
width S n decreases with sample size n. It is expected that as more observations 
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come in, properly shrinking 5 n could suppress both bias and variance, where the 
former is caused by local averaging and the latter is due to measurement error. 
This is only possible when / has certain smoothness such that large local fluctu- 
ations are not allowed. Therefore bandwidth tends to decrease at rate n -1 /( 2s+c 
depending on the smoothness level s of /. Since the scaling parameter a in the 
covariance function K a serves as an inverse bandwidth which would grow at 
rate n l /( 2s + d ) ? remote points tend to have exponentially decaying impact. As a 
result, one can imagine that accurate approximation of local intrinsic distance 
could provide good recovery of / as if we know the manifold and the associated 
intrinsic metric dj^. Note that for manifold .A4, the notion of "closeness" is 
characterized by the geodesic distances defined on A4. Often geodesic distances 
on J\A are not uniquely determined (section 3.1). Fortunately, Lemma 3.6 im- 
plies that for compact submanifolds, all distance metrics induced by Riemannian 
metrics on A4 are equivalent. Therefore we can choose any valid Riemannian 
metric as the base metric, which is the one induced by the ambient Euclidean 
metric in this paper. The following theorem is our main result which formalizes 
the above observations. 

Theorem 2.1. Assume that A4 is a d- dimensional compact C 1 submanifold 
of R D . For any fo G C S (M) with s < min{2,7 — 1} ; if we specify the prior 
as (2.2), then (4-1) will be satisfied for e n a multiple of n~ s ^ 2s+d \\ogn) Kl and 
e n a multiple of e n (\ogn) K2 with K\ = (1 + d)/(2 + d/s) and K2 = (1 + d)/2. 
This implies that the posterior contraction rate will be at least a multiple of 

n -s/(2s+d) Q Q g n )d+l _ 

The ambient space dimension D implicitly influences the rate via a multi- 
plicative constant. This theorem suggests that the Bayesian model (2.3) can 
adapt to both the low dimensional manifold structure of X and the smoothness 
s < 2 of the regression function. The reason the near optimal rate can only be 
allowed for functions with smoothness s < 2 is the order of error in approxi- 
mating the intrinsic distance dM by the Euclidean distance d (Proposition 3.5). 
Even if the intrinsic dimensionality d is misspecified as d! , the following theorem 
still ensures the rate to be much better than n~ 0<yl / D ^ when d! is not too small. 

Theorem 2.2. Assume the same conditions as in Theorem 2.1, but with the 
prior specified as (2.2) with d! ^ d and d! > d 2 / (2s + d). 

1. If d! > d, then the posterior contraction rate will be at least a multiple of 
n - s /(2 S +d / )(iog n )^ ; where k = (1 + d)/(2 + d'/s); 

2. If 2j+d < d' < d, then the posterior contraction rate will be at least a 

(2s + d)d / -d 2 

multiple of n H^+d)*' (\ gn) K , where k = (d + d 2 )/(2d' + dd' / s) + (1 + 
d)/2. 

2.3. Dimensionality Reduction and Diffeomorphism Invariance 

[27] and [24] initiated the area of manifold learning, which aims to design non- 
linear dimensionality reduction algorithms to map high dimensional data into 
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(M,g M ) — ( m >9m) 
1 d 

<£> |> 
(R D ,e) ^—^ (M J ,e) 

Fig 3. (Communicative) diagrams explaining the relationship between original ambient space 
and feature space. 

a low dimensional feature space under the assumption that data fall on an 
embedded non- linear manifold within the high dimensional ambient space. A 
combination of manifold learning and usual nonparametric regression leads to a 
two-stage approach, in which a dimensionality reduction map from the original 
ambient space R D to a feature space R d is estimated in the first stage and a 
nonparametric regression analysis with low dimensional features as predictors 
is conducted in the second stage. As a byproduct of Theorem 2.1, we provide a 
theoretical justification for this two stage approach under some mild conditions. 

Fig. 3 describes relationships used in formalizing this theory. The original 
predictor manifold M. sits in the ambient space R D . A Riemannian metric gM 
on M. is induced by the embedding map <£> and the Euclidean metric e on R D . \I/ : 
R D — >• R d is a dimensionality reduction map such that the restriction of \I/ 
on the embedding image $(M) c± M. is a diffeomorphism, which requires ^ m to 
be injective and both ^ m and its inverse to be smooth. The former requirement 
would imply d > d. Diffeomorphism is the least and only requirement such that 
both the intrinsic dimension d of predictor X and smoothness s of regression 
function / are invariant. ^ will naturally induce an embedding 

$ = M/o$:(X,5 A1 )^(K^e), (2.5) 

where the new Riemannian metric c/m is induced by the Euclidean metric e 
of R d . Finally Id is an identity map between the same set M. with different 
Riemannian metrics. Such a map ^ could also be chosen so that the induced 
embedding <l> satisfies some good properties, such as the equivariant embed- 
ding in shape analysis [13]. Due to the dimensionality reduction, the regression 
function becomes 

where / is a well defined function on the manifold M. represented in R d and 
has the same smoothness as /. Therefore, by specifying a GP prior (2.2) di- 
rectly on R d , we would be able to achieve a posterior contraction rate at least 
n - s /( 2s + d ) (logn) d+1 . The above heuristic can be formalized into the following 
theorem. 

Theorem 2.3. Assume that M is a d- dimensional compact C 1 submanifold 
of R D . Suppose that ^ : R D — >• R d is an ambient space mapping (dimension 



Y. Yang et al./Bayesian Manifold Regression 



10 



reduction) such that ^ restricted on Q(M) is a C 1 -diffeomorphism onto its 
image. Then by specifying the prior (2.2) with {^(Xi)}f =1 as observed predictors 
and Euclidean norm of R d as || • || in (2.1) ; for any fo G C S (M) with s < 
min{2,7 — 1,7' — 1}, (4-1) will be satisfied for e n = n~ s ^ 2s+d \\ogn) Kl and 
£~n = e n (logn) K2 with K\ = (1 + d)/(2 + d/s) and « 2 = (1 + d)/2. TTws implies 
that the posterior contraction rate will be at least e n = n _s ^ 2s+ ^(logn) d+1 . 

2.4. Measurement Error in the Predictors 

In applications, predictor Xi may not exactly lie on the manifold M.. We assume 
that Xi = X iQ + e^, where X i0 G M falls on the manifold and e$ ~ Nd(0 : 
are i.i.d measurement errors. In this case, choosing a linear projection map 
£ |rfxD ag dimensionality reduction ^ in the previous section can 
provide huge gain in terms of smoothing the data. As long as the elements of 
\I/ P do not have large variations, the central limit theorem ensures that the 
noise part ^ p e has order O p (L>~ 1/2 ), where e = (ei,...,e n ) G R Dxn . It is 
not straightforward to deterministically specify a linear projection \I/ P having 
good properties. Hence, we consider randomly generating \I/ P by sampling the 
elements i.i.d from a common distribution. The following multiplier central limit 
theorem [32, Lemma 2.9.5] provides support. 

Lemma 2.4. Let Z\,Z<i,.. be i.i.d. Euclidean random vectors with EZi = 
and E\\Zi\\ 2 < 00 independent of the i.i.d. sequence £i,£2j • ■ •> with E^i = and 
E£? = 1. Then conditionally on Zi, Z2, . . 

m 

y/m^]£iZj —> N(0, cov(Zi)) in distribution, 
j=i 

for almost every sequence Zi, Z2, . . .. 

For a fixed row ^ff = • • • , (id), its i.i.d components Qj can be viewed as 
£j in the lemma. Denote the rows of the noise matrix e by e^), . . . ,e( D y Viewing 
ey) as the Zj, by Lemma 2.4, we obtain that the new projected Ith predictor 

vector *f (Xi,...,X n ) T G M J has noise *fe = Ef=i = O p {D~ 1 ^). 

Therefore, the noise in the original predictors is reduced by random projection. 
The question is then whether the projected predictors can be included in a GP 
regression without sacrificing asymptotic performance relative to using X^o • The 
answer is the affirmative relying on Theorem 2.3 by the following argument. 

Theorem 2.3 only requires that \I/ P is a diffeomorphism when restricted on 
A4. Surprisingly, [2] (Theorem 3.1) proved more than this in the sense that 
for a compact d-dimensional Riemannian submanifold Ai of M D and a column 
normalized random projection \I/ P , if the projected dimension d is larger than 
0(d5~ 2 log(CDJ _1 ) log(p -1 )), where C is a positive constant depending on M., 
then with probability at least 1 — p, for every pair of points x, y G the 
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following holds 

where || • || is the Euclidean norm in M D or R d . This theorem implies that 




ty p preserve the ambient distances up to a scaling yd/D on the manifold by 
choosing 5 <C 1. In addition, this distance preservation property can also be 
extended to geodesic distances [2, Corollary 3.1]. Under the noised case, by 
normalizing the columns in \I/ P , the noise ^fe has order O p (D _1 ), which is of 
higher order compare to the scaling 0(D~ 1 ^ 2 ) in this theorem. Therefore, even 
if noise exists, a combination of the distance preservation property with the 
fact that ty p is a linear map implies that with large probability, \I/ P would be a 
diffeomorphism when restricted on M. Then Theorem 2.3 ensures that applying 
random projections in the first stage and plug in these projected predictors in a 
second state will not sacrifice anything asymptotically relative to using Xio in 
the GP. 



3. Geometric Properties 

We introduce some concepts and results in differential and Riemannian geome- 
try, which play an important role in the convergence rate. For detailed definitions 
and notations, the reader is referred to [9]. 



3.1. Riemannian Manifold 

A manifold is a topological space that locally resembles Euclidean space. A 
d- dimensional topological manifold M can be described using an atlas, where 
an atlas is defined as a collection {(U S1 (j) s )} such that M = \J S U S = and each 
chart <j) s : V — >• U s is a homeomorphism from an open subset V of ^-dimensional 
Euclidean space to an open subset U s of A4. By constructing an atlas whose 
transition functions {r s ^ = (f)^ 1 o cj) s } are C 1 differentiate, we can further 
introduce a differentiate structure on J\A. With this differentiate structure, 
we are able to define different iable functions and their smoothness level s < 7. 
Moreover, this additional structure allows us to extend Euclidean differential 
calculus to the manifold. To measure distances and angles on a manifold, the 
notion of Riemannian manifold is introduced. A Riemannian manifold (A4,g) 
is a differentiate manifold M in which each tangent space T P M is equipped 
with an inner product (•, -) p = g p (-,-) that varies smoothly in p. The family g p 
of inner products is called a Riemannian metric and is denoted by g. With this 
Riemannian metric g, a distance dM(p, q) between any two points p,q G M can 
be defined as the length of the shortest path on A4 connecting them. For a given 
manifold A4, such as the set P(n) of all n x n positive symmetric matrices [19, 
12], a Riemannian metric g is not uniquely determined and can be constructed 
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in various manners so that certain desirable properties, such as transformation 
or group action invariability, are valid. Although g is not uniquely determined, 
the smoothness of a given function / on M. only depends on A"Ts differential 
structure instead of its Riemannian metric. Therefore, to study functions on the 
manifold M, we could endow it with any valid Riemannian metric. Since a low 
dimensional manifold structure on the R^-valued predictor X is assumed in this 
paper, we will focus on the case in which M. is a submanifold of a Euclidean 
space. 

Definition 3.1. A4 is called a C 1 submanifold ofM D if there exists an inclusion 
map : M ^ R D , called embedding, such that $ is a diffeomorphism between 
M and <&(M) C R D , which means: 

(1) $ is injective and 7- differ 'entiable; 

(2) The inverse <l> _1 : §(M) — )• M. is also differentiate. 

A natural choice of the Riemannian metric g of M. is the one induced by the 
Euclidean metric e of R D through 

g p (u,v) = e^ p) (d<& p (u),d<& p (v)) = {d%(u), d$ p (v)) RD , Vu,v e T P M, 

for any p £ M. Under this Riemannian metric g, d<& p : T p A4 \-> dtf> p (T p M) C 
T^( p \R. D is an isometric embedding. Nash Embedding Theorem [20] implies that 
any valid Riemannian metric on M. could be considered as being induced from 
a Euclidean metric of R m with a sufficiently large m. Therefore, we would use 
this naturally induced g as the Riemannian metric of predictor manifold M. 
when studying the posterior contraction rate of our proposed GP prior defined 
on this manifold. Under such choice of g, M. is isometrically embedded in the 
ambient space R D . In addition, in the rest of this paper, we will occasionally 
identify Ai with <f>(Ai) when no confusion arises. 

Tangent spaces and Riemannian metric can be represented in terms of local 
parameterizations. Let (j) : U \-t M. be a chart that maps a neighborhood U of 
the origin in R d to a neighborhood (f>(U) of p e M. In the case that M. is a C 1 
submanifold of R D , (j) itself is 7-differentiable as a function from U G R d to R D . 
Given i G {1, . . . , d} and q = (j){u), where u = (ui, . . . , Ud) G U, define -^-(q) to 
be the linear functional on C 7 (A4) such that 

, V/gC^(A4), 

t=o 

where the d-dimensional vector ei has 1 in the i-th component and O's in others. 
Then -^~{q) can be viewed as a tangent vector in the tangent space T q Ai. 
Moreover, {-^-(q) : 1 < i < d} forms a basis of T q Ai so that each tangent 
vector v G T q Ki can written as 



d ( \( £\ d(fo(f)(u- 



tei)) 
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Under this basis, the tangent space of M. can be identified as R d and the matrix 
representation of differential d$ q at q has a (j, i)th element given by 



q \duiJ j . dt 



, i = l,...,d, j = l,...,D, 

t=o 



where we use the notation Fj to denote the jth component of a vector- valued 
function F. In addition, under the same basis, the Riemannian metric g q at q 
can be expressed as 

d 

g q (v,w)= ^2 v i w j gf j (u 1 ,...,u d ), 

i,3=l 

where (yi, . . . , v d ) and (wi, . . . , w d ) are the local coordinates for v, w G T q Ai. 
By the isometry assumption, 

d d 
gf j (u 1 ,...,u d ) = (d$ q ( — ) : d$ q (—)) RD . 

Riemannian volume measure (form) of a region R contained in a coordinate 
neighborhood 4>(U) is defined as 

Vol(R) = [ dV(q) 4 f Jdet(gUu))d Ul . . . du d . 
JR ^~ 1 (R) 

The volume of a general compact region R, which is not contained in a coordi- 
nate neighborhood, can be defined through partition of unity [9]. Vol generalizes 
the Lebesque measure of Euclidean spaces and can be used to define the integral 
of a function / G C(M) as f M f(q)dV(q). In the special case that / is supported 
on a coordinate neighborhood (j)(U), 

f(q)dV(q)= I f^(u))Jdet(gf j (u))du 1 ...du d . (3.1) 



3.2. Exponential Map 

Geodesic curves, generalizations of straight lines from Euclidean spaces to curved 
spaces, are defined as those curves whose tangent vectors remain parallel if they 
are transported and are locally the shortest path between points on the mani- 
fold. Formally, for p G M and v G T p M, the geodesic j{t,p,v),t > 0, starting 
at p with velocity v, i.e. j(0,p,v) = p and j'(t,p : v) = v, can be found as 
the unique solution of an ordinary differential equation. The exponential map 
E p : T P M. hA is defined by £ p (v) = 7(1, p,v) for any v G T P A4 and p G A4. 
Under this special local parameterization, calculations can be considerably sim- 
plified since quantities such as £ p 's differential and Riemannian metric would 
have simple forms. 



Y. Yang et al./Bayesian Manifold Regression 



14 



Although Hopf-Rinow theorem ensures that for compact manifolds the expo- 
nential map Sp at any point p can be defined on the entire tangent space T P A4, 
generally this map is no longer a global diffeomorphism. Therefore to ensure 
good properties of this exponential map, the notion of a normal neighborhood 
is introduced as follows. 

Definition 3.2. A neighborhood V of p G M is called normal if: 

(1) Every point q G V can be joined to p by a unique geodesic 7(£,p, f),0 < 
t < 1, with 7(0, v) = p and 7(1, p, v) = q; 

(2) £ p is a diffeomorphism between V and a neighborhood of the origin in 
T P M. 

Proposition 2.7 and 3.6 in [9] ensure that every point in M has a normal 
neighborhood. However, if we want to study some properties that hold uniformly 
for all exponential maps £ q with q in a small neighborhood of p, we need a notion 
stronger than normal neighborhood, whose existence has been established in 
Theorem 3.7 in [9]. 

Definition 3.3. A neighborhood W of p G M is called uniformly normal if 
there exists some 5 > such that: 

(1) For every q G W, £ p is defined on the S-ball B$(0) C T q A4 around the 
origin ofT q A4. Moreover, £ p (Bs(0)) is a normal neighborhood of q; 

(2) W C £ p (Bs(0)), which implies that W is a normal neighborhood of all its 
points. 

Moreover, as pointed out by [11] and [33], by shrinking W and reducing 5 at 
the same time, a special uniformly normal neighborhood can be chosen. 

Proposition 3.4. For every p G M there exists a neighborhood W such that: 

(1) W is a uniformly normal neighborhood of p with some S > 0; 

(2) The closure of W is contained in a strongly convex neighborhood U of p; 

(3) The function F(q,v) = (q,£ q (v)) is a diffeomorphism from Ws = W x 
Bs(0) onto its image in M xM. Moreover, \dF\ is bounded away from 
zero on Ws- 

Here U is strongly convex if for every two points in U , the minimizing geodesic 
joining them also lies in U . 

Throughout the rest of the paper, we will assume that the uniformly nor- 
mal neighborhoods also possess the properties in the above proposition. Given 
a point p G we choose a uniformly normal neighborhood W of p. Let 
{ei,...,e<i} be an orthonormal basis of T P A4. For each q G W, we can de- 
fine a set of tangent vectors {ef,...,e^} C T q M by parallel transport [9]: 
Ci G TpM. H> ej^ G T 7 ( £ ).A4 from p to q along the unique minimizing geodesic 
j(t) (0 < t < 1) with 7(0) = p, 7(1) = q. Since parallel transport preserves 
the inner product in the sense that g 1 ^{v 1< ^ t \w 1 ^) = g p (v,w),\/v,w G T p Ai, 
{el , . . . , e^} forms an orthonormal basis of T q Ai. In addition, the orthonormal 
frame defined in this way is unique and depends smoothly on q. Therefore, we 
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obtain on W a system of normal coordinates at each q G W, which parameterizes 
xe£ q {B s (0)) by 



^ i=i ' 



0«(tzi, . . . , u d ), u = (ui, ...,iid)e B s (0). (3.2) 



Such coordinates are called g-normal coordinates. The basis of T q M associated 
with this coordinate chart (Bs(0),(/) q ) is given by 



d -(q)(f) = d ^ o£ ^4)) _ d(/o 7 (t,g,e?)) 



dui dt 



t=o dt 



= e?(/), i = l,...,d. 

t=0 



Therefore {-^-(q) = e q \ 1 < i < d] forms an orthonormal basis on T q Ai. 
By Proposition 3.4, for each x G £ q (Bs(0)), there exists a minimizing geodesic 
7(t, q, v),0 < t < 1, such that 7(0, g,v) = q,J f (0,q,v) = v and 7(1,(7,1;) = x, 
where v = S~ 1 {x) = Yli=i u i e i ^ ^Al Hence d M (q,x) = J* \Y (£, q,v)\dt = 



\v\ = Nidi, i.e. 



where || • || is the Euclidean norm on R d . The components g^j(u) of the Rieman- 
nian metric in g-normal coordinates satisfy g^(0) = g q {e q ^e q ) = Sij. The fol- 
lowing results [11, Proposition 2.2] provide local expansions for the Riemannian 

metric g^u), the Jacobian yj det(gfj(u)) and the distance dj^iQ: Yli=i u i e i) i n 
a neighborhood of p. 

Proposition 3.5. Let M be a submanifold of R D which is isometrically em- 
bedded. Given a point p G M, let W and 5 be as in Proposition 3.4, and con- 
sider for each q G W the q-normal coordinates defined above. Suppose that 
* = Eti ^ e S q (B s (0)). Then: 

(1) The components g^j(u) of the metric tensor in q-normal coordinates admit 
the following expansion, uniformly in q G W and x G £ q (Bs(0)): 



1 d 

g q j (u 1 ,...,u d ) = 5 ij -- Ri sj (0)u r u s + O(d 3 M (q,x)), (3.4) 

r,s=l 

where R q rs j(0) are the components of the curvature tensor at q in q-normal 
coordinates. 

(2) The Jacobian 1 det{g q -){u) in q-normal coordinates has the following ex- 
pansion, uniformly in q G W and x G £ q (Bs(0)): 

1 d 

det(g? j )(u 1 ,...,u d ) = l-- R< a (0)u r u 8 + 0(d 3 M (q,x)), (3.5) 

r,s=l 

where Ric q s (0) are the components of the Ricci tensor at q in q-normal 
coordinates. 
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(3) There exists C p < oo such that 



< d 2 M (q,x) -\\q-x\\ 2 < C p d A M (q,x) 



(3.6) 



holds uniformly in q £ W and x £ £ q (B $(())). 

Note that in Proposition 3.5, (3) only provides a comparison of geodesic 
distance and Euclidean distance in local neighborhoods. Under a stronger com- 
pactness assumption on A4, the following lemma offers a global comparison of 
these two distances. 

Lemma 3.6. Let M. be a connected compact submanifold of W D with a Rie- 
mannian metric g that is not necessarily induced from the Euclidean metric. 
Then there exist positive constants C\ and C2 dependent on g, such that 



where 1 1 • 1 1 is the Euclidean distance in R D . Moreover, if M. is further assumed 
to be isometrically embedded, i.e. g is induced from the Euclidean metric ofW D , 
then C\ could be chosen to be one and C2 > 1. 

Proof. We only prove the first half of the inequality since the second half follows 
by a similar argument and is omitted here. Assume in the contrary that for any 
positive integer fc, there exists (xk,yk) such that \\xk — Vk\\ > kdM(%k,yk)- 
Let <E> : M — » R D be the embedding. Since M is compact, {xk} and {yk} 
have convergent subsequences, whose notations are abused as {xk} and {yk} 
for simplicity. Denote the limits of these two sequences as xo and yo. By the 
compactness of A4 and continuity of we know that $(M) is also compact 
and therefore dM&ktVk) —> 0, as k —> 00. This implies that xo = yo = p. 

For each j £ {1, . . . the jth component ; : A4 — »• R of is differentiable. 
Let S p and W p be the 5 and W specified in Proposition 3.4. Define f(q,v) = 
^{ii2{F{q 1 v))) = <&(£ p (v)), where 7T2 is the projection of M. x J\A on to its second 
component. By Proposition 3.4, / is differentiable on the compact set Ws p and 
therefore for each I £ {1, . . . , d}, ^ is uniformly bounded on Ws p . This implies 
that for some constant C > 0, \\x — y\\ = \\f(y, £~ l (x)) — f(y^£~ 1 (y))\\ < 
C\\£~ 1 (x) -£~ 1 (y)\\ = CdM(x,y) for all x,y e W p with d,M(x,y) < S p . Since 
Xk — » p and yk — » p, there exists an integer fco such that for all k > fco? %k-> 
yk £ W p and d M (xk,yk) < S p . Therefore \\x k - yk\\ < Cd M (xk,yk), which 
contradict our assumption that \\xk — yk\\ > kdM{ x k-,yk) fc> r all k. 

Consider the case when <l> is an isometric embedding. For any points x, y £ 
.M, we can cover the compact geodesic path l XyV from x to y by {W^ : i = 
1, . . . , n} associated with a finite number of points {pi, . . . ,p n } C M.. Therefore 
we can divide l XjV into U™ =1 l(x s -i, x s) such that xq = x, x n = y, and each 
segment l(x s -i,x s ) lies in one of the W Pi 's. By Proposition 3.5(3), for each 
s £ {1, . . . ,ra}, dA4(x s _i,x s ) > ||x s _i - x s ||. Therefore, 



CiHa; - < d M (x,y) < C 2 \\x - y\\, Vx,y £ M 



(3.7) 




where the last step follows from the triangle inequality. 



□ 
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The above lemma also implies that geodesic distances induced by different 
Riemannian metrics on M. are equivalent to each other. 

Fix p G M and let W and S > be specified as in Proposition 3.4. Since M. 
is a submanifold of R D , for any point q G M, the exponential map £ q : B$(0) —> 
M C R D is a different iable function between two subsets of Euclidean spaces. 
Here, we can choose any orthonormal basis of T q M. since the representations of 
£ q under different orthonormal bases are the same up to d x d rotation matrices. 
Under the compactness assumption on A4, the following lemma, which will be 
applied in the proof of lemma 4.4, ensures the existence of a bound on the partial 
derivatives of £ q s components {£ q ,i : i = 1, . . . , D} uniformly for all q in the S 
neighborhood of p: 

Lemma 3.7. Let M be a connected C 1 compact submanifold of R D with 7 
being 00 or any integer greater than two. Let k be an integer such that k < 7. 
Then: 

1. There exists a universal positive number So, such that for every p G M, 

proposition 3.4 is satisfied with some S > So and W p ; 

2. With this So, for any p G M, mixed partial derivatives with order less than 

or equal to k of each component of £ p are bounded in Bs (0) G T p M by a 
universal constant C > 0. 

Proof. Note that M. = U p gM^(^'^p)' wnere $p an d W(p,5 p ) are the corre- 
sponding p dependent S and open neighborhood W in proposition 3.4. By the 
compactness of we can choose a finite covering {W(pi, S Pl ), . . . , W(p ni S Pn )}. 
Let So = mm{S Pl , . . . , S Pn }. Then the first condition is satisfied with this So since 
for any p G M, W p could be chosen as any W(pj,5 Pj ) that contains p. 

Next we prove the second condition. For each j, we can define q- normal coor- 
dinates on W(pj,5 Pj ) as before such that the exponential map at each point q G 
W(pj,5 Pj ) can be parameterized as (3.2). Define Fj : W(pj, S Pj ) x B$ (0) — » R D 

by Fj(q,u) = £ q (Eti^) = 4> q {v). Then any order k mixed partial deriva- 

d k (f) q 

tive du . 1 ...Q u . k (u) of Fj(q, u) with respect to u is continuous on the compact set 
W(pj, S Pj ) x Bs (0). Therefore these partial derivatives are bounded uniformly 
in q G W(pj,5 Pj ) and u G Bs p .(0). Since M is covered by a finite number of 
sets {W(pi, S Pl ), . . . , W{p ni S Pn )}, the second conclusion is also true. □ 

By lemma 3.7, when a compact submanifold M. has smoothness level greater 
than or equal to fc, we can approximate the exponential map £ p : Bs o (0) C 
R d R D at any point p G M. by a local Taylor polynomial of order k with 
error bound C£q, where C is a universal constant that only depends on k and 
M. 

4. Posterior Contraction Rate of the GP on Manifold 

In the GP prior (2.3), the covariance function K a : J\A x J\A R is essen- 
tially defined on the submanifold A4. Therefore, (2.3) actually defines a GP 
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on M and we can study its posterior contraction rate as a prior for functions 
on the manifold. In this section, we combine geometry properties and Bayesian 
nonparametric asymptotic theory to prove the theorems in section 2. 

j^.l. Reproducing Kernel Hilbert Space on the Manifold 

Being viewed as a covariance function defined on [0, 1] D x [0, 1] D , K a (-, •) cor- 
responds to a reproducing kernel Hilbert space (RKHS) HP, which is defined 
as the completion of the linear space of all functions on [0, 1] D with the 
following form 

m 

x \-+ ^2 ctiK a (xi,x), x G [0, 1]"°, 
i=i 

indexed by ai, . . . , a m G R and xi,..., x m G [0, m G IN, relative to the 
norm induced by the inner product defined through (K a (x, •), K a (y, -))h« = 
K a (x, y). Similarly, K a (-, •) can also be viewed as a convariance function defined 
on AixAi, with the associated RKHS denoted by HP. Here HP is the completion 
of which is the linear space of all functions on J\A with the following form 

m 

x i — y ^2 ciiK a (xi,x), x G M, 

indexed by a±, . . . , a m G R and x±, . . . , x m G M, m G IN. 

Many probabilistic properties of GPs are closely related to the RKHS asso- 
ciated with its covariance function. Readers can refer to [1] and [30] for intro- 
ductions on RKHS theory for GPs on Euclidean spaces. In order to generalize 
RKHS properties in Euclidean spaces to submanifolds, we need a link to trans- 
fer the theory. The next lemma achieves this by characterizing the relationship 
between HP and HP. 

Lemma 4.1. For any f G HP ; there exists g G HP such that g\M = / and 
\\g\W a — 1 1 /Ilea, where g\M is the restriction of g on A4. Moreover, for any 
other g' G HP with g'\ M = f, we have \ \g'\\m° > I l/llia, which implies \ \f\\^ a = 
infgem a i9 \ M =f \\g\W°. 

Proof Consider the map $ : ti —> % that maps the function 

m 

ai, . . . , a m G R, xi, . . . , x m G M.,m G IN 

i=i 

on M. to the function of the same form 

771 

i=l 

but viewed as a function on [0,1]^. By definitions of RKHS norms, $ is an 
isometry between H and a linear subspace of %. As a result, $ can be extended 



Y. Yang et al./Bayesian Manifold Regression 



19 



to an isometry between HP and a complete subspace of HP. To prove the first 
part of this lemma, it suffices to justify that for any / G HP, g = <E>(/)|x = /. 
Assume that the sequence {f n } £ Ti satisfies 

Wfn ~ /Ilia -> 0, as n -> oo, 

then by the definition of on ti, $(/ n )|.M = /n- For any x G [0, 1] D , by the 
reproducing property and Cauchy-Schwarz inequality, 

\*(f n )(x) - g(x)\ = \{ K a (x r )M fn)-g)wA 

< y/K"(x,x) ||*(/n)-*(/)||H- 

||/n - f\\ua -> 0, as n oo, 

where the last step is by isometry. This indicates that g can be obtained as a 
point limit of 3>(/ n ) on [0,1]^ and in the special case when x G M, 

g(x) = lim $(f n )(x) = lim / n (x) = f(x). 

n— )>oo n^oo 

Denote the orthogonal complement of <£(HP) in HP as <&(H. a ) ± . Since (g r — 
9)\m = 7 which means (K a (x,-),g — g')w a = for all x G M. Therefore by 
the previous construction, g — g' _L $(HI a ), i.e. g f — g e ^(HI )^ and using 
Pythagorean theorem, we have 

□ 

This lemma implies that any element / in the RKHS HP could be considered 
as the restriction of some element g in the RKHS HP. Particularly, there exists a 
unique such element g in HP such that the norm is preserved, i.e. \ \g\\u a = 1 1/| |§a • 

4-2. Background on Posterior Convergence Rate for GP 

As shown in [10], in order to characterize the posterior contraction rate in 
a Bayesian nonparametric problem, such as density estimation, fixed/random 
design regression or classification, we need to verify some conditions on the 
prior measure II. Specifically, we describe the sufficient conditions for randomly 
rescaled GP prior as (2.2) given in [31]. Let X be the predictor space and fo 
be the true function fo : X IR, which is the log density \ogp(x) in density 
estimation, regression function J5[y|X] in regression and logistic transformed 
conditional probability logitP(F = 1\X) in classification. We will not consider 
density estimation since to specify the density by log density /o, we need to 
know the support M so that can be normalized to produce a valid density. 
Let e n and e n be two sequences. If there exist Borel measurable subsets B n of 
C(X) and constant K > such that for n sufficiently large, 

P(\\W A - foWoo <e n ) >e-" e «, 

P(W A iB n ) < e - 4ne -, (4.1) 
log7V(e n ,S n ,|| • I |oo) <ne 2 n , 
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where W A ~ II and || • ||oo is the sup-norm on C{X), then the posterior con- 
traction rate would be at least e n V e n . In our case, X is the d-dimensioanl 
submanifold M. in the ambient space R D . To verify the first concentration con- 
dition, we need to give upper bounds to the so-called concentration function 
[31] 0j o (e) of the GP W a around truth fo for given a and e. </>j (e) is com- 
posed of two terms. Both terms depend on the RKHS HP associated with the 
covariance function of the GP W a . The first term is the decentering function 
inf { 1 1 h 1 1 j| a : \ \h — /o 1 1 oo < e }? where 1 1 • 1 1 § a is the RKHS norm. This quantity mea- 
sures how well the truth fo could be approximated by the elements in the RKHS. 
The second term is the negative log small ball probability — log 1 1 W a 1 1 oo < e), 
which depends on the covering entropy log iV(e n , HJ, || • ||oo) of the unit ball in 
the RKHS HP. As a result of this dependence, by applying Borell's inequality 
[30] , the second and third conditions can often be proved as byproducts by using 
the conclusion on the small ball probability . 

As pointed out by [31], the key to ensure the adaptability of the GP prior 
on Euclidean spaces is a sub-exponential type tail of its stationary covariance 
function's spectral density, which is true for squared exponential and Matern 
class covariance functions. More specifically, a squared exponential covariance 
function K\(x, y) = exp { — \ \x — ?/| | 2 /2} on W D has a spectral representation as 



where \i is its spectral measure with a sub-Gaussian tail, which is lighter than 
sub-exponential tail in the sense that: for any S > 0, 



For convenience, we will focus on squared exponential covariance function, 
since generalizations to other covariance functions with sub-exponential decay- 
ing spectral densities are possible with more elaboration. 

4-3. Decentering Function 

To estimate the decentering function, the key step is to construct a function 
I a (f) on the manifold A4 to approximate a differentiate function /, so that 
the RKHS norm ||ia(/)| Infra can be tightly upper bounded. Unlike in Euclidean 
spaces where functions in the RKHS HP can be represented via Fourier trans- 
formations [31], there is no general way to represent and calculate RKHS norms 
of functions in the RKHS HP on manifold. Therefore in the next lemma, we pro- 
vide a direct way to construct the approximation function I a (f) for any truth 
/ via convolving / with K a on manifold A4: 





(4.2) 
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where V is the Riemannian volume form of M.. Heuristically, for large a, the 
above integrand only has non-negligible value in a small neighborhood around 
x. Therefore we can conduct a change of variable in the above integral with 
transformation (j) x : B§ —> W defined by (3.2) in a small neighborhood W of x: 



where the above approximation holds since: 1. (j) x (0) = x\ 2. (jf preserve lo- 



cal distances (Proposition 3.5 (3)); 3. the Jacobian y det(gfj(u)) is close to one 
(Proposition 3.5 (2)). From this heuristic argument, we can see that the approx- 
imation error \ \I a (w) — /o||oo is determined by two factors: the convolution error 



| ( — 7=) d f Rd exp { — a ^ } / (cjf (n)) du — f{x)\ and the non-flat error caused 



by the nonzero curvature of M.. Moreover, we can expand each of these errors 
as a polynomial oil/ a and call the expansion term related to 1 / a k as kih order 
error. 

When M. is Euclidean space R d , the non-flat error is zero, and by Taylor 
expansion the convolution error has order s if fo G C s (M d ) and s < 2, where 
C s (M d ) is the Holder class of s-smooth functions on R d . This is because the 
Gaussian kernel exp{ — \\(x — y)\\ 2 /2} has a vanishing moment up to first order: 
J xexp(— \\(x — y)\\ 2 /2)dx = 0. Generally, the convolution error could have 
order up to s + 1 if the convolution kernel K has vanishing moments up to 
order s, i.e. J x t K(x)dx = 0,t = However, for general manifold M. 

with no n- vanishing curvature tensor, the non-flat error always has order two 
(see the proof of Lemma 4.2). This implies that even though carefully chosen 
kernels for the covariance function can improve the convolution error to have 
order higher than two, the overall approximation still tends to have second order 
error due to the deterioration caused by the nonzero curvature of the manifold. 
The following lemma formalizes the above heuristic argument on the order of 
the approximation error by (4.3) and further provides an upper bound on the 
decentering function. 

Lemma 4.2. Assume that A4 is a d- dimensional compact C 1 submanifold of 
R D . Let C S (M) be the set of all functions on M with holder smoothness s. 
Then for any f G C S (M) with s < min{2,7} ; there exist constants ao > 1, 
C > and B > depending only on n, M and f such that for all a> ao, 




/(^(u))^det(4(u))du, 



/(^(0)) =/(*), xeM, 



inf{||/i||| : sup \h(x) - f(x)\ < Ca~ s } < Ba d . 



xeM 



Proof. The proof consists of two parts. In the first part, we prove that the ap- 
proximation error of I a (f) can be decomposed into four terms. The first term T± 
is the convolution error defined in our previous heuristic argument. The second 
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term T2 is caused by localization of the integration, which is negligible due to the 
exponential decaying of the squared exponential covariance function. The third 
and fourth terms T3, T4 correspond to the non-flat error, with T3 caused by ap- 
proximating the geodesic distance with Euclidean distance 1 1 1 (j) q (u) — q \ | 2 — 1 1 u \ | 2 1 , 

and T4 by approximating the Jacobian | \J dei(gf-(u)) — 1 1 . Therefore the overall 
approximation error \I a (f)(x) — f(x)\ has order s in the sense that for some 
constant C > dependent on M and /: 

sup \I a (f)(x) - f(x)\ < Ca~ s , s < min{2, 7 }. (4.4) 

In the second part, we prove that I a (f) belongs to HP and has a squared RKHS 
norm: 

\\I a (f)\\l a <Ba d , 

where B is a positive constant not dependent on a. 

Step 1 (Estimation of the approximation error): This part follows similar 
ideas as in the proof of Theorem 1 in [33], where they have shown that (4.4) 
holds for s < 1. Our proof generalizes their results to s < 2 and therefore needs 
more careful estimations. 

By Proposition 3.5, for each p e M, there exists a neighborhood W p and 
an associated S p satisfying the two conditions in Proposition 3.4 and equations 
(3.4)- (3.6). By compactness, M. can be covered by U pe j>W p for a finite subset 
V of M. Then sup xeM \I a (f)(x) - f(x)\ = sup peV {swp xeWp \I a (f)(x) - f(x)\}. 
Let (5* = min pG -p{min{(5p, 1/ \/2C p }} > 0, where C p is defined as in equation 
(3.6). Choose ao > 1 sufficiently large such that Co \J (2d + 8) log ao / ao < 
where Co is the C2 in Lemma 3.6. 

Let q eW p and a > a . Define B% = {x e M : oIm(q, x) < Co\/ (2d + 8) log a/a}. 
Combining equation (3.3) and the fact that £ q is a diffeomorphism on ^*(0), 

d 
i=l 

where B a = {u : \ \u\ \ < C ^(2d + 8) log a/a} C B s *(0). 

Denote <jfl\u) = £ q (T,i=i u i e i)- Then B l = <t> q (Ba). By definition (3.1), 

(-^ d J Bq K*(x,y)f(y)dV(y) 

= (^) d f §a exp { - a2 ' k - f }fmu))^^)(u)du. 

Therefore, by (4.3) we have the following decomposition: 
W){q) ~ f(q) = T 1+ T 2 +T 3 + T 4 , 
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where 



}[/(^(«))-/(^(0))]< 
T 2 =(-^) d J^K a ( q ,y)f(y)dV(y)-(-^=y exp { - } f(q)du, 



o 1 1 1 1 2 

a A \\uy z 



a V f f a 2 ||g-^(u)|| 2 



))du, 



T 4 =^- 7 =j ^ expj- ^ y^(u))^det(g^)(u)-l)du. 

Step 1.1 (Estimation ofTj: Let g = fo^. Since / G C S (.M) and (</>«, B S *(Q)) 
is a C 7 coordinate chart, we have g G C s (IR d ) and therefore 

f R(u,s), if < s < min{l,7}, 

g W ^-^Y^^^Ui + R^s), ifl<s<min{2, 7 }, 

where the remainder term \R(u, s)\ < Ci\\u\\ s for all < s < min{2,7}. Since 
B a is symmetric, 

f f « 2 |I^H 2 1 , „ • , 
y exp | — 1^ = 0, z = l,...,d, 

and therefore 

|Ti|<Ci(-£=y / expl-^M^lll^l| S ^ = ^2a- s . 



5£ep (Estimation of T2): Denote T2 = S\ + 6*2 where 5i and S2 are the 
first term and second term of T2, respectively. By Lemma 3.6, for y G .A4\I^, 
Ik — > dM(q, y)/Co > \j (2d + 8) log a/a. Therefore, 



|5ii = 



— V / ex [ a2 llg-^H 2 ] 



3 Vol(M) 
= C 3 a" 4 < C 3 a- S . 
As for 52, we have 



(2d + 8) log a 



|S 2 |<||/||J^=) / exp{ 

VV^TT/ ^||n||>CoA/(2d+8) log a/a I ^ J 



/27T/ J\\u\\>C y/(2d+8) log a/a 



<II/|U^=) / exp ;- ^ + 8)lo g a - x 
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since d > 1, Co > 1 and a > ao > 1. 

Combining the above inequalities for Si and $2, we obtain 

|T 2 |<(C3 + C 4 )a- s =C 5 a- s . 

Step 1.3 (Estimation ofTs): By equation (3.6) in Proposition 3.5 and equa- 
tion (3.3), we have 

|IM| 2 -||<Z-^(«)|| 2 | = \d 2 M (q,r(u))-\\q-^(u)\\ 2 \ < C p d 4 M (<l, <l> q = C p \\u\\ 4 . 

(4.5) 

Therefore by using the inequality | e a — e b \ < \a — b\ max{e a ,e 6 }fora,6>0, 
we have 

1-1^11,11 ( a Y [ J / a 2 \\q-<t> q {u)\\ 2 \ f a 2 \\u\\ 2 \\a 
\T 3 \ < H/Hoo [-^) Imaxiexpi },exp 



Q 2 IMI 2 j 



2 



'2?T/ J Ba I I * 

By equation (4.5) and the fact that u G ||^|| 2 < (J*) 2 < 1/(2C P ) and hence 
||H| 2 -||g-0«(«)|| 2 | < \\\uf, ||g-^(«)|| 2 >^|H| 2 . (4.6) 

Therefore 

|T 3 |<||/|u(-^y / e X pi-^^\^^du = C 6 a- 2 <C 6 a- s , 



since a > ao > 1. 

S^ep i.^ (Estimation ofT^): By equation (3.5) in Proposition 3.5, there exists 
a constant Cj depending on the Ricci tensor of the manifold .M, such that 



|^det(4)( U )-l|<C 7 |H| 2 . 
Therefore, by applying equation (4.6) again, we obtain 

, d r ( „2||„,||2 

/2tt 

Combining the above estimates for 7\, T2, T3 and T4, we have 



r 4 | < C 4 ||/||oo( ) / exp<i - - """ \\\u\\ 2 du = C 8 a- 2 < C 8 a 



sup \I a (f)(q)(x) - f(q)(x)\ < (C 2 + C 3 + C 6 + C 8 )a" s = Ca~ s . 
xeM 

Step 2 (Estimation of the RKHSnorm): Since (K a (x, -),K a (y, -)) ta = K a (x,y), 
we have 

2d 

a 



IM/)llfi- = ( / / K a (x,y)f(x)f(y)dV(x)dV(y) 
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Applying the results of the first part to function / = 1, we have 



[ K a (x,y)dV(y)-l ■ 
/2tt/ Jm 

since a > ao > 1. Therefore, 



< Car' < C, 



< (1 + C)\\f\\l( ) Vol(M) = Ba d . 



□ 



4-4- Centered Small Ball Probability 



As indicated by the proof of Lemma 4.6 in [31], to obtain an upper bound 
on — logPdll^Hoo < e), we need to provide an upper bound for the covering 
entropy log N(e, Mf , 1 1 • | |oo) of the unit ball in the RKHS HP on the submanifold 
Ai. Following the discussion in section 4.1, we want to link HP to HP, the 
associated RKHS defined on the ambient space M. D . Therefore, we need a lemma 
to characterize the space HP [31, Lemma 4.1]. 

Lemma 4.3. HP is the set of real parts of the functions 



x i y 



when i/j runs through the complex Hilbert space Z/2(/i a )- Moreover, the RKHS 
norm of the above function is 1 1 I U 2 w here fi a is the spectral measure of the 
covariance function K a . 

Based on this representation of HP on R D , [31] proved an upper bound 
Ka D { log ~) +1 for log iV(e, HJ, || • ||oo) through constructing an e-covering set 
composed of piecewise polynomials. However, there is no straightforward gen- 
eralization of their scheme from Euclidean spaces to manifolds. The following 
lemma provides an upper bound for the covering entropy of MJ, where the D in 
the upper bounds for HIJ is reduced to d. The main novelty in our proof is the 
construction of an e-covering set composed of piecewise transformed polynomials 
(4.12) via analytically extending the truncated Taylor polynomial approxima- 
tions (4.9) of the elements in M". As the proof indicates, the d in a d relates to 
the covering dimension d of A^, i.e. the e-covering number N(e, M,e) of M is 
proportional to l/e d . The d in (log relates to the order of the number k d 

of coefficients in piecewise transformed polynomials of degree k in d variables. 

Lemma 4.4. Assume that M is a d-dimensional C 1 compact submanifold of 
R D with 7 > 2. Then for squared exponential covariance function K a , there 
exists a constant K depending only on d, D and M, such that for e < 1/2 and 
a > maxjao, e _1 ^ 7_1 ^} ; where So is defined in Lemma 3.7 and ao is a universal 
constant, 



\ogN(e^\\'\\ 00 )<Ka d hog-j 



Y. Yang et al./Bayesian Manifold Regression 



26 



Proof. By Lemma 4.1 and Lemma 4.3, a typical element of HP can be written 
as the real part of the function 

h^(x) = [ e i ^ x) ^(\)p a (d\), for x G M 



for ip : R D — > C a function with j \ifj\ 2 p a (d\) < 1. This function can be ex- 
tended to R D by allowing x G R D . For any given point p G .M, by (4.3), we 
have a local coordinate <j) p : i?<5 (0) C R d — >• R D induced by the exponential 
map £ p . Therefore, for x G P (-B^ O (O)), h^(x) can be written in local g-normal 
coordinates as 

VpM = h*{p(u)) = J e^ P ^^(X)p a (d\), u G B* o (0). (4.7) 

Similar to the idea in the proof of Lemma 4.5 in [31], we want to extend the 
function h^^ p to an analytical function z \-> J e % ^ x, ^ P ^ ijj(X) p a (d\) on the set 
Q = {z G C d : ||Rez|| < #o, l|I m2; ll < p/ a } f° r some p > 0. Then we can obtain 
upper bounds on the mixed partial derivatives of the analytic extension h^ iP 
via Cauchy formula, and finally construct an e-covering set of by piecewise 
polynomials defined on Ai. Unfortunately, this analytical extension is impossible 
unless (j) p (u) is a polynomial. This motivates us to approximate (j) p (u) by its 
7th order Taylor polynomial P pn (u). More specifically, by Lemma 4.7 and the 
discussion after Lemma 3.7, the error caused by approximating (f) p (u) by P pn (u) 
is 

M^W) -M P ivrM)| <a\\<IP(u)-P vn (u)\\ <Ca\\u\p. (4.8) 

For notation simplicity, fix p as a center and denote the function h^(P pn (u)) 
by r(u) for u G B$ . Since P pn (u) is a polynomial of degree 7, view the function 
r as a function of argument u ranging over the product of the imaginary axes 
in C d , we can extend 

r(u) = I e« x > p *«M^(\)fjL a (d\), u G B 6o (0) (4.9) 



to an analytical function z \-> j e^ A,jPp ^^ijj{\)pL a {d\) on the set Q = {z G C d : 
||Rez|| < So, ||Imz|| < p/a} for some p > sufficiently small determined by the 
S < 1/2 in (4.2). Moreover, by Cauchy- Schwarz inequality, \r(z)\ < C for z G ft 
and C 2 = J e 6 ^ x ^ p(dX). Therefore, by Cauchy formula, with D n denoting the 
partial derivative of orders n = (ni, . . . , rid) and n\ = ri\\ • • • n^!, we have the 
following bound for partial derivatives of r at any u G B$ o (0), 



D n r(u) 



where R = p/ (a\fd). Based on the inequalities (4.8) and (4.10), we can construct 
an e-covering set of as follows. 
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Set a = p/(25 Vd), then R < 2S . Since M C [0, 1] D , with C 2 defined 
in Lemma 3.6, let {pi, • • • ,Pm} be an i?/(2C2)-net in M for the Euclidean 
distance, and let M = [j i Bi be a partition of M in sets Bi, . . . , obtaining 
by assigning every x e M to the closest G {pi, . . . ,p m }. By (3.3) and Lemma 
3.6 

WT\*)\<C2^=*<^ (4.11) 

where (j) Pi is the local normal coordinate chart at pi. Therefore, we can consider 
the piecewise transformed polynomials P = YliLi Pi.a^Bii with 



(4.12) 



<k 



Here the sum ranges over all multi-index vectors n — (ni, . . . , rid) G (IN U {0}) d 
with n. = ni + • • • + rid < k. Moreover, for y = (yi, ... , yd) G the notation 
?/ n used above is short for y^y^ 2 '''Vd d - We obtain a finite set of functions 
by discretizing the coefficients a^ n for each i and n over a grid of meshwidth 
e/i? n -net in the interval \-C/R n ^C/R n ) (by (4.10)). The log cardinality of this 
set is bounded by 



log(] #a i>n ) <mlog( 



<k 



<k 



2C/R r > 
e/R n 



< mk log 



2C 

e 



Since R = p/(aVd), we can choose m = N(M,\\ • 1 1, p/ (2C ad 1/2 )) ~ a d . To 
complete the proof, it suffices to show that for k of order log(l/e), the resulting 
set of functions is a if e-net for constant K depending only on p. 

For any function / G HJ, by Lemma 4.1, we can find a g G such that 
#|.M = /• Assume that r g (the subcript g indicates the dependence on g) is the 
local polynomial approximation for g defined as (4.9). Then we have a partial 



derivative bound on r g as: 



D n r g ( Pi ) 



< 



C_ 



Therefore there exists a universal constant K and appropriately chosen a$ in 
(4.12), such that for any z G Bi C M, 



D n r g (p i ) i 



n. >k 



i, m >k 



^ 00 /d-l /9\ fc 



l=k+l 



<k 



.<fc 



Moreover, by (4.8) and (4.11), 

\g(z) - r g (z)\ < Ca\\{^)-\zW < ^ < Ka~^ < Ke, 
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where the last step follows by the condition on a. 
Consequently, we obtain 

\f(z)-P iint (z)\ = \g(z)-Pi, nt {z)\ < \g(z)-r g (z)\ + \r g (z)-P hni (z)\ < Kc(^J +2Ke. 

This suggests that the piecewise polynomials form a 3if e-net for k sufficiently 
large so that (2/3) k is smaller than Ke. □ 

Similar to Lemma 4.6 in [31] , Lemma 4.4 implies an upper bound on — log P(\\ W a \\oo < 

Lemma 4.5. Assume that M is a d- dimensional compact C 1 submanifold of 
R D with 7 > 2. If K a is the squared exponential covariance function with inverse 
bandwidth a, then for some ao > 0, there exist constants C and eo that only 
depend on ao, \i, d, D and Ai, such that, for a > maxjao, e -1 ^ 7-1 ^} and 
e < e , 

-logP( sup \W*\ <e)< Ca d ( log - 
xeM \ e 

Before proving Theorem 2.1, we need another two technical lemmas for prepa- 
rations, which are the analogues of Lemma 4.7 and 4.8 in [31] for RKHS on 
Euclidean spaces. 

Lemma 4.6. For squared exponential covariance function, if a < b, then y/aMf C 
VbB.\. 

Proof For any / G y/aM^ by Lemma 4.1, there exists g G y/aMi such that 
g\ M = f. By Lemma 4.7 in [31], y^HI? C VbM\, so g G y/bE$. Again by Lemma 
4.1, since g\ M = /, ||/|| ib < ||^|| Hb < y/b, implying that / G VbU b v □ 

Lemma 4.7. Any h G satisfies \h(x)\ < 1 and \h(x) — h(x')\ < a\\x — x'\\r 
for any x, x' G M, where r 2 = j | |A| \ 2 dfi(X). 

Proof By the reproducing property and Cauchy-Schwarz inequality 

\h(x)\ = \(h,K a (x,.)) Aa \<\\K a (x,.)\\ Aa =l 
\h(x)-h(x')\ = \{h,K a {x,.)-K a {x',.))^ a \ 
< \\K a (x,-)-K a (x',-)\\ Aa 



= a/2(1 - K a (x,x')). 

By the spectral representation K(x,x') = J e 1 ^'^ /j, a (d\) and the fact that /j, a 
is symmetric, 

2(1 - K a {x, x')) = 2 J (1 + i(A, x-x')- e < < A ' x - a! '>)/i a (dA) 
< \\x-x'\\ 2 J \\\\\ 2 » a (d\) 
= a 2 \\x - x'\\ 2 J ||A|| 2 M(rfA). 

□ 
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^.5. Posterior Contraction Rate of GP on Manifold 

We provide proofs for Theorem 2.1 and Theorem 2.2. 

Proof of Theorem 2.1. Define centered and decentered concentration functions 
of the process W a = (W ax '• x e M) by 

W{e) = -logP(\W a \ QO <e), 

W= - inf |N| a -logP(|Tnoc<e), 

/iGH-:|/i-/ |oo<e 

where I |oo = sup^^ \f(x)\ is the sup norm on the manifold M. Then P(|VF a |oo < 
e) = exp(— </>o (e)) by definition. Moreover, by the results in [14], 

P(\\W a - /olloo < 2c) > e~^\ (4.13) 

Suppose that /o G C S (.M) for some 8 < min{2,7 — 1}. By Lemma 4.5 and 
Lemma 4.2, for a > ao and e > Cmax{a~^~^,a~ s } = Ca _s , 

(/>% (e) < Da d + C A a d flog ^ J < ^ W log ^ . 

Since A d has a Gamma prior, there exists p, Ci, C2 > 0, such that Cia p exp(— D2d d ) < 
g(a) < C20P exp(— D2d d ). Therefore by equation (4.13), 

P(\\W A - /olloo < 2e) > P(||W A - /olloo < 2e, A G [(C/e) 1 / 5 , 2(C/e) 1 / 5 ]) 

,2(C/e) 1 / s 

> / e^/o^a^a 

./(C/e) 1 /* 

> cie- K2(i/e)d/s(i ° g(i/e))i+d ^~y /s 1/s . 

Therefore, 

P(||^ A - /olloo <e n ) >exp(-ne^), 

for e n a large multiple of n~ s ^ 2s+d \\ogn) Kl with «i = (1 + d)/(2 + d/s) and 
sufficiently large n. 

Similar to the proof of Theorem 3.1 of [31], by Lemma 4.6, 



B M ,r,S,e = M 



^Hi + eBi) U ^ |J (MHJ) + eBi^ , 



with Bi the unit ball of C(M), contains the set MHJ + eBi for any a G [6,r]. 
Furthermore, if 



M > 4^/^(e) and e _ ^ (c) < 1/4, (4.14) 

then 

^ B) < 2 ° 2r - / + e" M (4.15) 

Z^d 
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By Lemma 4.5, equation (4.14) is satisfied if 

M 2 > 16C 4 r d (log(r/e)) 1+d , r > 1, e < e u 
for some fixed e\ > 0. Therefore 

P(W A iB)< exp(-C ne 2 n ), 

for r and M satisfying 

r d = ^nel M 2 = max{8C , 16C 4 }ne^ (log(r/e n )) 1+(i . (4.16) 

Denote the solution of the above equation as r n and M n . 
By Lemma 4.4, for > 2e and r > ao, 

logN(2e,M x fcw i + eBi, || ■ Hoc) < log TV f e, M, fe, 1 1 • || 



< Kr d lo ; 



. (5 
M^fr~/8 



By Lemma 4.7, every element of MHJ for a < S is uniformly at most S\/DrM 
distant from a constant function for a constant in the interval [— M, M]. There- 
fore for e > 5VDtM, 

log TV f3e, |J (Mi;) + dBi, || ■ ||oo) < N(e, [— M, M], | • |) < ™. 

With S = e/(2V^DrM), combining the above displays, for B = BM,r,5,e with 

M > e, M 3 / 2 V2rrD 1 / 4 > 2e 3/2 , r > a , 
which is satisfied when r = r n and M = M n , we have 

lo gW (3 e ,B, || • ||_) < AV^og ( M3/2 ^7 D ' /4 ))' +J + >og^. (4.17) 

Therefore, for r = r n and M = M„, 

logJV(3e n ,B,||-||oo) <ne 2 n , 
for e n a large multiple of e n (logn)^ 2 with = (1 + d)/2. □ 
Proof of Theorem 2.2. Under d! ', the prior concentration inequality becomes: 
P(||W^ - /olloo < 2e) > P(||W A - /olloo < 2e, A € [(C/e) 1 / 8 , 2(C/e) 1 / s ]) 



■/(C/e) 1 / 5 

> c 1 e-^( i ^ tivti ' /s ^( i / e )) i+d ^y /s ^y /s . (4.1? 
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The complementary probability becomes: 

P(W A iB)< 2 ° 2r - 6 + e" M / 8 , (4.19) 

D2 

with M 2 > 16C4r d (log(r/e)) 1+d , r > 1 and e < ei, where ei > is a fixed 
constant. 

An upper bound for the covering entropy is unchanged and still given by 
(4.17). 

1. d! > d: With e n a multiple of n- s ^ 2s+d '\\ogn)^ with «i = (1 + d)/(2 + 

d'/s), < Cn, 

2(° 

r d ' = -— %e 2 , and M 2 = max{8C , 16C 4 }ne 2 (log(r/e n )) 1+(i , 

inequalities (4.18), (4.19) and (4.17) become 

P(\\W A -f \\oo<e n )>eM-ne 2 n ), 

P(W A ^)<exp(-C ne 2 ), 
logJV(3e n ,B,|H|oo) <ne 2 . 

Comparing the above with (4.1), we arrive at the conclusion that under d' > d, 
the posterior contraction rate will be at least a multiple of n - s /(. 2s + d )(logn) K 
with k= (1 + d)/(2 + d'/s). 

2. 2^ < d! < d: With e n a multiple of n- s ^ 2sJrd \\ogn) Kl with k x = (1 + 

d)/(2+d/s),e n a multiple of ^/^^-^^'(logn)^ 1 )/ 2 = n" (2 &*+*)*' (\ogn) K2 
with k 2 = (d + d 2 )/(2d' + dd'/s) + (1 + d)/2, 

r d' = ^n6 2 n , and M 2 = max{8C , 16C 4 }ne 2 (log(r/e n )) 1+d , 

inequalities (4.18), (4.19) and (4.17) become 

P(||^ A -/o||oo<e n )>exp(-ne 2 ), 

P(W A iB) <exp(-C ne 2 ), 
logN^B^W-WJ) <ne 2 n . 
Comparing the above with (4.1), we arrive at the conclusion that under d! < d, 

_ (2s + d)d / -d 2 

the posterior contraction rate will be at least a multiple of n 2(2 S +d)d' (\ogn) K 
with k = (d + d 2 )/(2d' + dd' / s) + (1 + d)/2. To make this rate meaningful, we 
need (2s + d)d' - d 2 > 0, i.e. d! > d 2 / (2s + d). □ 

5. Numerical Example 

We provide a numerical example using the lucky cat data (Fig. 1). This data 
set has intrinsic dimensionality one, which is the dimension of the rotation 



Y. Yang et al./Bayesian Manifold Regression 



32 



Table 1 

Square root of MSPE for the lucky cat data by using three different approaches over 100 
random splitting are displayed. The numbers in the parenthesis indicate the standard 

deviations. 





n = 18 


n = 36 


n = 54 


EN 


.416(.152) 


.198(.042) 


.149(.031) 


LASSO 


.431(.128) 


.232(.061) 


.163(.038) 


GP 


.332(.068) 


.128(.036) 


.077(.014) 


2GP 


.181(.051) 


.124(.038) 


.092(.021) 


RPGP 


.340(0.071) 


.130(.039) 


.077(.015) 



angle 0. Since we know the true value of we create the truth fo(0) = cosO 
as a continuous function on the unit circle. The responses are simulated from 
Yi = fo(0i) + by adding independent Gaussian noises ~ 7V(0,0.1 2 ) to the 
true values. In this model, the total sample size TV = 72 and the predictors 
Xi G R p with D = 16, 384. To assess the impact of the sample size n on the 
fitting performance, we randomly divide n = 18, 36 and 64 samples into training 
set and treat the rest as testing set. Training set is used to fit a model and testing 
set to quantify the estimation accuracy. For each training size n, we repeat this 
procedure for m = 100 times and calculate the square root of mean squared 
prediction error (MSPE) on the testing set, 

E^E ^ ~ f ° m ^ 

i=i ieTi 

where T\ is the Ith testing set and Yi is an estimation of £"[y|Xi] = fo(6i). 
We apply three GP based algorithms on this data set: 1. vanilla GP specified 
by (2.3); 2. Two stage GP (2GP) where the D-dimensional predictors were 
projected into R 2 by using Laplacian eigenmap [3] in the first stage and then 
a GP with projected features as predictors was fitted in the second stage; 3. 
Random projection GP (RPGP) where the new predictors were produced by 
projecting the original predictors into R 1000 with a random projection matrix 
^ P = (Vij) € R 1000x16384 with ^ij - i.i.d. 7V(0, 1). To assess the prediction per- 
formance, we also compare our GP prior based models (2.3) with lasso [28] and 
elastic net (EN) [34] under the same settings. We choose these two competing 
models because they are among the most widely used methods in high dimen- 
sional regression settings and perform especially good when the true model is 
sparse. In the GP models, we set d = 1 since the sample size for this dataset 
is too small for most dimension estimation algorithms to reliably estimate d. In 
addition, for each simulation, we run 10,000 iterations with the first 5,000 as 
burn-in. 

The results are shown in Table. 1. As we can see, under each training size 
n, GP performs the best. Moreover, as n increases, the prediction error of GP 
decays much faster than EN and Lasso: when n = 18, the square root of MSPEs 
by using EN and lasso are about 125% of that by using GP; however as n in- 
creases to 54, this ratio becomes about 200%. Moreover, the standard deviation 
of square root of MSPEs by using GP are also significantly lower than those 
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Table 2 

Square root of MSPE for the lucky cat data with noised predictors, results over 100 random 
splitting are displayed. The numbers in the parenthesis indicate the standard deviations. The 
numbers after RPGP indicates the projected dimension d. 



o-x 





10 


20 


40 


80 


GP 
RPGP(IO) 
RPGP(IOO) 
RPGP(IOOO) 


.077(.014) 

.275(.065) 
.106(.023) 
.077(.015) 


.095(.015) 
.291(.069) 
.116(.026) 
.088(.017) 


.116(.017) 
.335(.075) 
.143(.033) 
.102(.018) 


.180(.020) 
.452(.085) 
.225(.043) 
.178(.021) 


.276(.23) 

.606(.102) 
.360(.065) 
.289(.033) 



by using lasso and EN. Among GP based methods, RPGP has slightly worse 
performance than GP under small training size, but as n grows to 54, they have 
comparable MSPEs. It is not surprising that 2GP has better performance than 
GP when n is small since the dimensionality reduction map ^ is constructed 
using the whole dataset (the Laplacian eigenmap code we use cannot do inter- 
polations). Therefore when the training size n become closer to the total data 
size 72, GP becomes better. In addition, GP is computationally faster than 2GP 
due to the manifold learning algorithm in the first stage of 2GP. 

To compare the performances between GP and RPGP in the case when there 
are noises in the predictors, we add iV(0, (JxId) noises into each predictor vector 
Xi with noise levels ax = 0, 10, 20, 40 and 80, where the range of predictors is 
~ 255. We also change the projected dimension d from 10 to 1,000. The 
training size n is fixed at 54. Table. 2 displays the results. 

As we can see, for small d = 10 or 100, applying GP on the original predictors 
appears to be better than RPGP on the projected predictors under any settings. 
As d grows to 1, 000, GP and RPGP have similar performances in the noise free 
setting. However, as noises are added to the predictors, RPGP with d = 1,000 
outperforms GP. However, as the noise increases to the order comparable to 
the signals, GP becomes close to and finally outperforms RPGP. In addition, 
the standard deviation of RPGP also grows rapidly as noise increases. This 
suggests that GP might be more stable than RPGP under small signal-to- noise 
ratio scenarios. 

6. Discussion 

In this work, we considered a nonparametric Bayesian prior for high dimensional 
regression when the predictors are assumed to be lying on a low dimensional 
intrinsic manifold. The proposed prior can be considered as an extension of a 
Gaussian process prior on Euclidean space to a general submanifold. We show 
that this GP prior can attain near optimal posterior convergence rate that can 
adapt to both the smoothness of the true function (s < 2) and the underly- 
ing intrinsic manifold A4. Our theorem validates the surprising phenomenon 
suggested by Bickel in his 2004 Rietz lecture [5] under the GP prior scenario: 

"... the procedures used with the expectation that the ostensible dimension D is correct 
will, with appropriate adaptation not involving manifold estimation, achieve the optimal 
rate for manifold dimension d." 
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Moreover, we also provide theoretical guarantees for two stage GP with di- 
mensionality reduction. We suggest the use of random projection GP as a special 
two stage GP when noises exist in the predictors. 

One possibility of our future work is to investigate whether the smoothness re- 
quirement s < 2 could be relaxed. This extension will be dependent on whether 
Lemma 4.2 could be improved to s > 2. Currently we construct the approxima- 
tion function I a (f) in RKHS through convolving / with the covariance function. 
It is not clear whether this is the best way to approximate the function / by 
elements in the RKHS. 

A second possibility is to build a coherent model not only estimating the 
regression function i£[Y|X], but simultaneously learning the dimensionality d 
of the intrinsic manifold A4. Our current GP prior (2.3) completely ignores the 
information contained in the marginal distribution Px of the predictor X. As 
an alternative, we can only model part of Px and therefore utilize some of Px's 
information, such as the support or dimensionality of A4. 
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